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Abstract — Thus far, sparse representations have been exploited 
largely in the context of robustly estimating functions in a noisy 
environment from a few measurements. In this context, the exis- 
tence of a basis in which the signal class under consideration is 
sparse is used to decrease the number of necessary measurements 
while controlling the approximation error. In this paper, we 
instead focus on applications in numerical analysis, by way of 
sparse representations of linear operators with the objective of 
minimizing the number of operations needed to perform basic 
operations (here, multiplication) on these operators. We represent 
a linear operator by a sum of rank-one operators, and show 
how a sparse representation that guarantees a low approximation 
error for the product can be obtained from analyzing an induced 
quadratic form. This construction in turn yields new algorithms 
for computing approximate matrix products. 

I. Introduction 

OPERATIONS on large matrices are a cornerstone of 
computational linear algebra. With a few exceptions 
such as the approximate Lanczos and power methods, most 
algorithms used by practitioners aimed to optimize speed of 
computation under the constraint of obtaining an exact result. 
Recently, spurred by the seminal paper of Frieze et al. [1], 
there has been a greater interest in finding algorithms which 
sacrifice the precision of the result for a gain in the speed of 
execution. 

Consider the low-rank approximation problem; i.e., finding 
a matrix Af. of rank at most k which approximates a given 
matrix A. The best matrix A k , best in the sense that it 
minimizes \\A — A k \\ for any unitarily invariant norm (e.g., 
spectral or Frobenius norms), can be obtained by computing 
the singular value decomposition (SVD) of A. (Throughout 
this paper, we adopt the Frobenius norm; we use the nota- 
tion Ak to denote the best rank-fc approximation to A and 
Ak to denote an approximation to it — it will be easy to 
avoid confusion with Ai, which is used to denote the 
column of A) But in some instances, evaluating the SVD, 
which scales as C(n 3 ) where n is the largest dimension 
of A, may be too costly. Frieze et al. in [1] showed that 
Ak can be reasonably well approximated by computing the 
SVD of a subset of the columns of A only, where the 
columns are sampled according to their relative powers — 
i.e., Pr(pick column i) oc ||Ai|| 2 /^ ||^4i|| 2 , with the expected 
error coming from using the approximation A k instead of A/, 
being of the form E \\A - A k \\ 2 < \\A - A k \\ 2 + e\\A\\ 2 . In 
subsequent papers it has been argued that the additive error 
term in \\A\\ 2 may be large, and thus other sampling techniques 
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have been introduced to obtain relative approximation error 
bounds (see, e.g., [2], [3]). 

In this paper, we address the sparse representation of 
linear operators for the approximation of matrix products. An 
important object that will appear in our study is the so-called 
Nystrom method (see Section II) to find a low-rank approxi- 
mation to a positive kernel. This method, familiar to numerical 
analysts, has nowadays found applications beyond its original 
field, most notably in machine learning. In previous work [4], 
we proposed an approach for low-rank approximation in such 
applications; here we show that the task of finding optimal 
sparse representation of linear operators, in order to evaluate 
their product, is related to the Nystrom extension of a certain 
positive definite kernel. We will the use this connection to 
derive and bound the error of two new algorithms, which our 
simulations indicate perform well in practice. 

Related to our work is that of [5] for matrix products. In 
that paper, Drineas et al. showed that a randomized algorithm 
sampling columns Ai and rows of A and B in proportion 
to their relative powers \\Ai\\ 2 and H-E^ H 2 yields an expected 
error of E \\AB - AB\\ 2 < const • fc" 1 1| A|| 2 ||B|| 2 . Notice that 
this bound does not involve a low-rank approximation of A 
or B. In contrast, we obtain a randomized algorithm bound in 
which the approximating rank k of a kernel related to A and 
B appears explicitly. 

The methods mentioned above are all adaptive, in the sense 
that they require some knowledge about A and B. A very 
simple non-adaptive method is given by an application of the 
Johnson-Lindenstrauss Lemma: it is easy to show [6] that if 
W is a k x n matrix with independent unit Normal entries and 
x, y e K", then for < e < 1 we have that 

Y>Y(\{x,y)-k- 1 {Wx,Wy)\ < e||a;||||?/||) > l-4e _ *<£-£>. 

Letting A 1 denote the i th row of A and Bj the j th column of 
B, we observe the following element-wise relation: 

(AB)ij - k- 1 (AW T WB) ij = (A\Bj) — k~ 1 (WA l , WBj), 

and thus we see that approximating AB by k~ 1 AW T WB 
yields a good result with high probability. Later we compare 
this method to our algorithms described below. 

The remainder of this paper is organized as follows. In 
Section II, we briefly review the Nystrom method used to 
approximate positive definite matrices [7], [8]. In Section III, 
we introduce the problem of approximating a matrix product 
and highlight two key aspects: the issue of best subset se- 
lection and the issue of optimal rescaling. We then solve the 
optimal rescaling problem and analyze a randomized and a 
deterministic algorithm for subset selection; we conclude with 
simulations and a brief discussion of algorithmic complexity. 
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II. Approximation Via The Nystrom Method 

To provide context for our results, we first introduce the 
so-called Nystrom method to approximate the eigenvectors of 
a symmetric positive semi-definite (SPSD) matrix. 



A. The Nystrom Method for Kernel Approximation 

The Nystrom method, familiar in the context of finite 
element methods, has found many applications in machine 
learning and computer vision in recent years (see, e.g., [8] 
and references therein). We give here a brief overview : Let 
k : [0, 1] x [0, 1] — > R be a positive semi-definite kernel 
and (Aj, /»), i = 0, 1, . . . , M, denote pairs of eigenvalues and 
eigenvectors such that 



[0,1] 



k{x,y)fi(y)dy = Xifi{x). 



(1) 



The Nystrom extension is a method to approximate the eigen- 
vectors of k(x,y) based on a discretization of the interval 
[0, 1]. Define the M + l points Xi by x m = x m -\ + 1/M with 
x = 0, so that the Xi's are evenly spaced along the interval 
[0,1]. Then form the Gram matrix K mn := k(x m , x n ), which 
in turn is used to approximate (1) by a finite-dimensional 
spectral problem 



M 



^—^K mn v i (n) = \ v i v i (rn), i = 0,l,...,M. 



The Nystrom extension then uses these Vi to give an estimate 
fi of the i th eigenfunction as follows: 



fi(x) = 



(M + 1)A 



-^2k(x,x m )vi(m). 



This method can also be applied in the context of matrices. 
Let Q be an n x n SPSD matrix, partitioned as 



Q 



Qj y 

Y T Z 



where Qj e R kxk and k is typically much smaller than n. It is 
then possible to approximate k eigenvectors and eigenvalues of 
Q by using the eigendecomposition of Qj as follows. Define 
Q = UAU T and Qj = UjAjUj T with U, Uj orthogonal and 
A, A j diagonal. The Nystrom extension then tells us that an 
approximation for k eigenvectors in U is given by 



U = 



Uj 



These approximations U ~U and Aj ~ A in turn yield an 
approximation Q to Q as follows: 



Q = UKjU 1 



Qj Y 
Y T Y T Q J ~ 1 Y 



The quality of this approximation can then be measured as the 
(e.g., Frobenius) norm of the Schur complement of Qj in Q: 

\\Q-Q\\ = \\Z-Y T Q J - 1 Y\\. 



III. Approximation of Matrix Products 

Let A e R mxn and B e R nxp . We use the notation A l to 
denote the columns of A and B % the rows of B. We can write 
the product AB as the sum of rank-one matrices as follows: 



AB = s ^A i B\ 



(2) 



i=l 



Our approach to estimate the product AB, akin to model 
selection in statistics, will consist of keeping only a few terms 
in the sum of (2); this entails choosing a subset of columns 
of A and of rows of B, and rescaling their outer products as 
appropriate. To gain some basic insight into this problem, we 
may consider the following two extreme cases with A eR nx2 
and B = A T . First, suppose that the vectors A\ and A2 are 
collinear. Then AB = AxAj+aAxAj = (1+a) A-lAJ, hence 
we can recover the product without error by only keeping one 
term of the sum of (2) and rescaling it appropriately provided 
that we know the correlation between A\ and A 2 . At the 
other extreme, if A\ and A 2 are orthogonal, rescaling will 
not decrease the error no matter which term in (2) is kept. 

Hence we see that there are two key aspects to the problem 
of sparse matrix product approximation as formulated above: 

Optimal Model Selection 

Which rows/columns should be retained? 

Optimal Reweighting 

How should these rows/columns be rescaled? 

As we show below, the latter of these problems can be 
solved exactly for a relatively low complexity. For the former, 
which is combinatorial in nature and seemingly much harder 
to solve, we give an efficient approximation procedure. 

IV. Solving the Optimal Reweighting Problem 

We first consider the problem of optimal reweighting, con- 
ditioned upon a choice of subset. In particular, suppose that an 
oracle gives us the best subset J C {1, . . . , n} of cardinality 
k to estimate the product AB. Without loss of generality, we 
assume that J = {1, . . . , k}. We then have the following result 
characterizing how well one can estimate the product AB: 

Theorem 1. Let the n x n SPSD matrix Q be defined as 
Qij = (A4, Aj)(B i , B j ) (i.e., Q = (A T A) (BB T ), where 
is the Hadamard or entrywise product of matrices) and have 
the partition 

' Qj Y 
Y T Z 



Q 



(3) 



where J = {1, . . . , k} without loss of generality, and Qj is 
the corresponding principal submatrix. 

Then the best approximation to the product AB using the 
terms {AiB l } i£ j is given by 



where 
and 



AB w AB — ^WiAiB 1 , 
ieJ 

w := Q J ~ 1 r 

n 

:=J2(^,A 3 )(B\BJ), teJ. 
3=1 



(4) 
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Moreover, if E is the (n — k) x (n — k) matrix with all We now rewrite (5) in a more manageable form: 
entries equal to one, then the squared approximation error 

in Frobenius norm is given by \\AB - AB\\ 2 = \\AB\\ 2 - 2(w, r) + w T Qjw. (7) 

\\AB — AB\\ — ti(Sc(Qj)E), -phe optimal weight vector is now obtained by setting the 

•./ c \ ry V Tr^ -iv w c / ; . e gradient of (7) to zero. Hence we obtain 

with bc(Qj) := Z — Y Qj Y the Schur complement oj ° 

Q jin ®- w = Q J ~ 1 r, 

This result tells us how well we can approximate the product 

AB granted that we know only a few rows/columns of A and which proves the first part of the statement. 
B, and their correlations with the remaining rows and columns. For the second part, first notice that if [1] is the vector whose 

It also allows us to characterize the best subset J of size k; it entries are all one, we have the following expression for r: 
is the subset that minimizes tr(Sc(Qj)E). 

Proof: Given the subset J of {1, ... , n}, we seek the best r=[Qj Y}[1}. 

scaling factors Wi to minimize the squared approximation error 

\\AB - AB\\ 2 . We can write the squared error as Hence ' at the optimum, the error is 

\\AB -AB\\ 2 = \\AB\\ 2 [lf[Qj Y\ t Qj-\Qj Y}[1] 
\\AB-J2 mA^\\ 2 = = ||^ B ||2 _ ^tq^ 

»=i 

(k k \ ~ 

^ B i ) T (AB A B i ) I where we see that Q is the Nystrom approximation of Q as 

^ 11 ' ^ 11 ' J described in Section II. Using Lemma 1 below, we have 

By distributing the product and using the linearity of the trace, \\AB — AB\\ 2 = [1] T (Q — Q)[l], 

we get 



(k k \ 

(AB - WiA i B i ) T (AB - w l A l B l ) j 
i=i »=i / 



which finishes the proof of the Theorem. 



The proof of the second part of Theorem 1 is based on the 
identity proven below: 

tr ((AB) T AB) — 2 ^ u>;tr ((AB) T AiB 1 ) Lemma 1. Let A and B be real matrices of dimensions mxn 

1=1 and n x p, respectively, and let E be the n x n matrix with 

all entries equal to one. The following identity holds: 



+ tr £ w t A t BY(J2 w ^ Bl ) > ( 5 ) 



||AB|| 2 =tr((A T A)Q(BB T )E). (8) 



\ i=l i=l / 

where we made use of the following equality: 

T t i t\ Proof: Recall that we can write the product AB as a sum 

tr {(AB) A,B l ) = tr (((AB) A t B l ) ) of n rank . one terms as foUows: 

= tv{(A i B i ) T AB). 

AB = \ A ■ R* 

We now work towards rewriting (5) in a more manageable ~~ 1 ' 

form. First, using the fact that tr(AB) = ti(BA), we see that 1-1 

tr(A B i ) — (B i A-) (6) ^ e ^ US nave ' definition of the Frobenius norm, that 

By combining (2) and (6), we have \\AB\\ 2 = tr((AB) T AB) 



= tv^A.BY^A^)) 

* 3 

tv^BYAfAjBi) 



k kin 

5> 4 tr {(ABfA^) = £>,-tr ^(BifAjAiB* 
i=\ »=i \j=i 

k n l 3 

= Y,w i Y i {B i ,Bi){A i ,A j ) = {w,r). = £ tr((£T) T Af AjB*). 

i=i j=i ij 

Similarly, using (6), we get after an easy computation that Using ±e invariance of the trace with respect t0 cyclic 

(k k \ permutations, the last equation yields 

S K U J \\AB\\ 2 = Y,(AuA i ){B\B>), 

= Y i w i w i {(A i ,A i ) (B\Bi)) = w T Qjw. 13 

*i and the relation (8) is proved. ■ 
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A. Approximating the Optimal Subset Selection Procedure 

Having shown a solution to the optimal reweighting problem 
according to Theorem 1, we now turn our attention to the 
companion problem of optimal subset selection. In order to 
minimize the approximation error, we have to find the subset 
J whose associated Schur complement Sc(Q.j) has the lowest 
possible power along the one-dimensional subspace of R n ~ k 
spanned by the vector [1]. Determining the eigenvectors and 
eigenvalues of this Schur complement, and relating them 
to A and B, is not an easy task. Here we present two 
approximations: one based on a random choice of subsets, 
and an alternative "greedy" approach which yields a worst- 
case error bound. 

1 ) Random Choice of Subset: We first discuss a random 
oracle which outputs a subset J with probability PQ,k{J) 
defined below. Recall our earlier definition of the matrix 
Q = A T A BB T according to Theorem 1; this approach 
is motivated by the expression of the resultant squared error, 
conditioned_upon having chosen a subset J C {l,...,n}, 
as || AB - AB|| 2 = tr(S c (Qj)E). Since S c (Q,j) is positive 
definite, we have that tr(Sc(Qj)) is larger than the largest 
eigenvalue of Sc(Q.j), and we can bound this error as follows: 

\\AB-AB\\ 2 < (n-k)tY(Sc(Qj)). (9) 

Note that equality is obtained when Sc(Q.j) oc E, and hence 
this bound is tight. We have investigated in [9] an algorithm 
to minimize ||Sc(Q./)||, wmcn nas been shown to be effective 
in the context of low-rank covariance matrix approximation. 

Returning to our random oracle, note that both A T A and 
BB T are positive definite, and thus by the Schur Theo- 
rem [10], Q is also positive definite. From this, we conclude: 

1) There exists a matrix X e R nxn such that Q = X T X; 

2) All the principal minors det(Qj) of Q are positive. 
Consequently, we assume here that an oracle returns a subset 
J C {1, . . . , n}, | J\ — k with probability 

PQ,k(J) ■= ^det(Qj), (10) 

where K = >~2j \ j\=k det(<2 j) is a normalizing constant, and 
the second fact above ensures that this probability distribution 
is well defined. We may then adapt the following result from 
the proof of Theorem 1 in [4] : 

Theorem 2. Let Q € W ixn be a positive quadratic form with 
eigenvalues Ai > A2 > . . . > A„. If J C {1, . . . , n}, | J| = k 
is chosen with probability pq^{J) oc det(Qj), then 

n 

Etr(S c (Qj)) < (fc + 1) Xi - 

i=k+l 

Combining (9) with (11) leads, via Jensen's inequality, to an 
upper bound on the average error of this approach to random 
subset selection: 

E || AB - AB|| < y/(n-k)(k+l)\\X - X k \\, 

where X is defined via the relation X T X = A t AqBB t , and 
Xk denotes the optimal rank-fc approximation to X obtained 
by truncating its singular value decomposition. 



Despite the appearance of the term \Jn — k in this bound, 
it serves to relate the resultant approximation quality to the 
ranks of A and B, a feature reinforcing the well-foundedness 
of the accompanying algorithm we present below. In particular, 
if k > rank(A) rank(_B), then the approximation error is zero 
as expected. For practical reasons, we may also wish to relate 
this error to the eigenvalues of A and B. To this end, let M 
and N be two n x n matrices, P = M N, and let Ui{M) 
(resp. (7i(N), tJi(P)) be the singular values of M (resp. N, 
P) sorted in non-increasing order. We then have the following 
majorization relation [11]: 

m m 

5>*(^) <Y,ai(M)ai(N), for m = 1, 2, . . . , n. 

i=i i=i 

In particular, if M = A T A, N = BB T , and Q = X T X = 
M N, then the singular values of Q, M, and N are the 
squares of the singular values of X, A and B respectively: 

<J2°!(A)aj(B), form = l,...,n. (12) 

We may then conclude from (12) that 

\\X X k \\ 2 < min (al(A)\\B\\ 2 , a?(B)||A|| 2 ) - ||X fe || 2 . 

Although the approach presented above relies on an oracle 
to sample in proportion to det(<5j), we will subsequently 
outline a realizable algorithm based on these results. 

2) Deterministic Choice of Subset: Recall that Theorem 1 
indicates we should ensure that the diagonal terms of Z 
are kept as small as possible. Hence, as a deterministic 
approximation to the optimal subset selection procedure, we 
may take J such that it contains the indices of the k largest 
terms {A i ,A i )(B l ,B' 1 ). While yielding only a worst-case 
error bound, this approach has the advantage of being easily 
implementable (as it does not require sampling according to 
det(Qj)); it also appears to perform well in practice [9]. This 
greedy algorithm proceeds as follows: 



Algorithm 1 Greedy Approximate Matrix Multiplication 
Given matrices A e K mx ™ an d B e W lXp and a positive 
integer k < n: 

1) Set T := {{A i ,A i )(B i ,B i )}, i = 1, . . . , n, and take 
J := ifc} to be the indices of the k largest 
elements of T. 

2) Set Q e R kxk as := (A, A?) { B \ Bj ), for hi G 
J. 

3) Set rel'as r, := £" =1 (A, A,-) (B\ B^), for i e J. 

4) Set w ■■=ST 1 r and Aj := {A}, Bj := {B 1 } for i e J. 

5) Return AB := Aj dia,g(w)Bj as an approximation to 
AB. 



Since the error term is the sum of all the terms in the Schur 
complement, we can look to bound its largest element. To this 
end, we have the following result: 

Lemma 2. The largest entry in Sc(Qj) is smaller than the 
largest diagonal element of Z in (3). 
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Fig. 1. Matrix product approximation error using the power rescaling of (15) Fig. 2. Matrix product approximation error using the optimal rescaling of 
applied to each of the four subset selection algorithms described in Sec. V-A Theorem 1 applied to the subset selection algorithms described in Sec. V-A 



This lemma confirms that a good error-minimization strat- 
egy is to make sure that the diagonal terms of Z are as small 
as possible, or equivalently to take J such that it contains the 
indices of the k largest (Ai, Ai) (B l , B l ) as per Algorithm 1. 

The proof of Lemma 2 is based on the following set of 
simple results: 

Lemma 3. If Q is a positive definite matrix, then maxy Qij 
is positive and on the diagonal of Q. 

Proof: Since Q is positive definite, we know there exists 
a matrix I e I" x " such that 

Qij = \Xi,Xj) . 

By the Cauchy-Schwartz inequality, we have 

(X X .X 3 ) 2 < {X^XJiX^Xj}, 

from which we deduce that one of the following inequalities 
has to be satisfied: 



(X h Xj) < (Xi,Xi) or {Xi,X-) < (X^Xj) 



(13) 



Now if we suppose that maxy Qij is not a diagonal element, 
the relations of (13) yield a contradiction — and hence the 
largest entry of Q is on its main diagonal. ■ 

The entries of Sc(Qj), the Schur complement of Qj in 
Q, can be characterized explicitly according to the following 
formula: 

Lemma 4 (Crabtree-Haynsworth [12]). LetQj = Qi,...,k;i...,k 
be a nonsingular leading principal submatrix of Q obtained 
by keeping the rows and columns with indices 1, . . . , k. Then 
Sc(Qj), the Schur complement of Qj in Q, is given element- 
wise by 

det(Qi, 



(Sc(Qj)h 



,k,i;l, 



,k,j) 



(14) 



det(Qj) 

Furthermore, it is possible to bound the diagonal entries of 
Sc(Qj) as follows: 



Lemma 5 (Fischer's Lemma [10]). If Qj is a positive definite 
matrix, then 

det(Q JU{j} ) < det(Qj)Qu. 

We are now ready to give the proof of Lemma 2: 

Proof of Lemma 2: The preceding two lemmas tell us that 
the diagonal entries of Sc(Qj) are bounded by max^jQ^ 
(i.e., the largest diagonal element of Z, according to the 
partition of (3)). And using Lemma 3, we know that every 
entry of Sc(Qj) is bounded by these diagonal entries. ■ 

Lemma 2 can be further refined to give a worst-case 
error bound for deterministic matrix product approximation, 
conditioned on a choice of subset J and the corresponding 
optimal reweighting procedure. Appealing to the inequality of 
arithmetic and geometric means to further bound the elements 
of Sc(Qj), the results of Theorem 1 and Lemmas 3-5 yield: 



\AB-AB\\< (n-k) 



\M 2 \m\ 2 ) 



V. Simulation Studies and Complexity 

A. Experimental Results 

We now present preliminary experimental results and dis- 
cuss the computational complexity of the algorithms under 
consideration. Three sets of experiments were performed, in 
which we compared the performance of four subset selection 
methods: a baseline uniform sampling on fc-subsets; sampling 
according the row/column powers [5]; sampling in proportion 
to the fc-principal minors of Q according to (10); and selecting 
greedily according to Step 1 of Algorithm 1. We also compared 
the choice of reweighting following subset selection, in one 
case applying the optimal reweighting of Theorem 1 and in the 
other simply reweighting according to the row/column powers 
(see [1], [5]): 

AB = 7 rr .' - Wr. (15) 

teJ Vl-'l 



\Ai 



\B i 
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To test these various experimental conditions, we drew 
200 random matrices A e and B g M i5x90 in 

total, each having independent unit Normal entries. We then 
averaged the error of the randomized algorithm over 20 
trials per matrix product, and report relative error in dB as 
201og 10 (\\AB - AB||/(||A||||B||)) for each test condition. 

In the first set of experiments, shown in Figure 1, we 
compare the four different algorithms for subset selection 
described above, applied in conjunction with a reweighting 
according to row/column powers. The highest-error method in 
this case corresponds to choosing the subset J uniformly at 
random, and thus should be understood as a baseline measure 
of performance as a function of approximant rank k. It can also 
be seen that sampling J according to the relative powers of 
the row/columns of A and B, and sampling via a Metropolis- 
Hastings algorithm (with independent proposal distributions 
taken in proportion to row/column powers), yield similar 
results, with both improving upon the baseline performance. 
The best results in this case are obtained by the greedy subset 
selection method indicated by Step 1 of Algorithm 1. 

In a second set of experiments, we followed the same 
procedure as above to compare subset selection procedures, 
but applied the optimal reweighting of Theorem 1 rather 
than a rescaling according to row/column powers. Perfor- 
mance in this case is (as expected) seen to be better overall, 
but with the ordering of the methods unchanged. As our 
final experiment, we compare the method of Algorithm 1 
(greedy subset selection followed by optimal rescaling) to 
two non-adaptive methods: choosing row/columns of A and 
B uniformly at random and rescaling according to n/k, and 
the simple Johnson-Lindenstrauss random projection approach 
outlined in Section I. These non-adaptive methods can be seen 
to yield significantly worse performance than Algorithm 1, 
suggesting its potential as a practical method of selecting 
sparse representations of linear operators that yield low ap- 
proximation errors for the resultant matrix products. 

We conclude with a brief discussion of the algorithmic com- 
plexity of Algorithm 1 . First, assume without loss of generality 
that m > n,p, and recall that straightforward matrix multipli- 
cation requires 0(m 3 ) operations, though the best algorithm 
known so far (the Coppersmith- Winograd algorithm [13]) 
can perform this operation in C(m 2 38 ). Evaluating T in 
Algorithm 1 requires the computation of 2n inner products 
of m-or p-dimensional vectors, and hence requires (D(2mn) 
operations. Extracting the k largest elements of a set of size m, 
as is necessary to construct J, can be done efficiently using a 
variation on the Quicksort algorithm (see [14]) in C(mlog k). 
The matrix Q is symmetric and its diagonal is a restriction 
of T. Hence it requires the computation of an additional 
2 x k(k~ l)/2 inner products, and thus 0(mk(k— 1)) opera- 
tions. Evaluating r requires 0(2m(n — k)) operations, taking 
into account the fact that k terms of the sum also appear in Q. 
Finally, evaluating w can be done using Gaussian elimination 
in 0(k 3 ) operations. Hence the overall complexity is given by 
C(m(fc(fc-l) + 2(2n-fc)+logfc) + fc 3 ) = 0(m,(n+k 2 ) + k 3 ). 



it.*. 
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Fig. 3. Matrix product approximation error using non-adaptive random 
projections (Johnson-Lindenstrauss), non-adaptive subset selection (uniform), 
and adaptive subset selection (Algorithm 1) 
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